####

Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.

Motivation


According to a recent report, overall tobacco use increased in youths (middle schooland high school students) in the United States in 2017 and 2018, despite previous years of declining use.

This major increase is attributed to an increase in the use of electronic cigaarette (e-cigarette) products.

E-cigarettes are referred to by many different names, including but not limited to:

  1. Electronic nicotine delivery systems (ENDS)
  2. Vapes
  3. e-hookahs
  4. vape pens
  5. tanks
  6. mods

The devices vary greatly:

[source]

See this CDC guide and the American Lung Association website for more information.

The report found that:

During 2017–2018, current use of any tobacco product increased 38.3% (from 19.6% to 27.1%) among high school students and 28.6% (from 5.6% to 7.2%) among middle school students; e-cigarette use increased 77.8% (from 11.7% to 20.8%) among high school students and 48.5% (from 3.3% to 4.9%) among middle school students.

In 2018, the Federal Drug Administration (FDA) in the United States stated that e-cigarette usage use among youth reached:

“nothing short of an epidemic proportion of growth

In this case study, we will be invistigating the same data used in the report that generated the above findings. This data comes from the The National Youth Tobacco Survey (NYTS).

Gentzke, Andrea S., Melisa Creamer, Karen A. Cullen, Bridget K. Ambrose, Gordon Willis, Ahmed Jamal, and Brian A. King. “Vital Signs: Tobacco Product Use Among Middle and High School Students - United States, 2011-2018.” MMWR. Morbidity and Mortality Weekly Report 68 (6): 157–64 (2019).

Main Questions


Our main question:

  1. How has tobacco/nicotine use by American youths changed since 2015?
  2. How do vaping rates compare between males and females?
  3. What vaping brands and flavors appear to be used the most frequently?
    We will base this on the following survey questions:
    > “During the past 30 days, what brand of e-cigarettes did you usually use?”
    >“What flavors of tobacco products have you used in the past 30 days?”

  4. Have vaping rates possibly influenced tobacco/nicotine use?

Learning Objectives


In this case study, we will cover how to import data from multiple files efficiently, how to import data from excel files, and how to make a variety of visualizations to compare multiple groups across time. We will also demonstrate how to work with codebooks. We will cover the concept of survey weighting and introduce the srvyr package. We will discuss the difference between pooled cross-sectional data and panel data. We will especially focus on using packages and functions from the Tidyverse for wrangling data, such as tidyr and dpyr and for visualization, such as as ggplot2. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.


We will begin by loading the packages that we will need:

Package Use
here to easily load and save data
readxl to import the data in the excel files
purrr to import the data in all the different excel and csv files efficiently
readr to import the CSV file data
dplyr to arrange/filter/select/compare specific subsets of the data
summarytools to get an overview of data in a different style
tidyr to rearrange data in wide and long formats

stringr | to manipulate the character strings within the data
ggplot2 | to make visualizations with multiple layers
ggpubr | to easily add regression line equations to plots
forcats | to change details about factors (categorical variables)
lmerTest| to perform linear mixed model testing
car| to perform Levene’s Test of Homogeneity of Variances
ggiraph| to make plots interactive
ggforce| to modify facets in plots
viridis| to plot in color palette
cowplot | to allow plots to be combined skimr | to get an overview of data

The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.

Context


According to the cited Morbidity and Mortality Weekly Report this was what was already known about this topic and the implications of this study:

Importantly, the vapors used in e-cigarettes contain harmful chemicals:

E-cigarette usage has also been associated with lung injury

See here for additonal information about the potential health effects of e-cigarettes in teens and young adults.

Limitations


There are some important considerations regarding this data analysis to keep in mind:

  1. The data included in the National Youth Tobacco Survey (NYTS) does not follow the same individual students over time. A longitudinal study that does follow the same individuals over time collects data called panel data. The data in this study is called pooled [cross-sectional data]https://en.wikipedia.org/wiki/Cross-sectional_data, this data is obtained from random collection of obervations across time.

According to wikipedia: >Panel data differs from pooled cross-sectional data across time, because it deals with the observations on the same subjects in different times whereas the latter observes different subjects in different time periods

  1. The data also includes percentages of students that reported use of particular tobacco product, but the survey questions did not ask how much uses of one product compared to another - for example the survey asked questions like:“What flavors of tobacco products have you used in the past 30 days?” Thus, it is unclear how often one flavor was used by the same individual over another.

While gender and sex are not actually binary, the dataused in this analysis only contains information for groups of individuals who answered the survey questions as male or female.

What are the data?


The data in this case study comes from the National Youth Tobacco Survey (NYTS) which is an annual survey that asks students in high school and middle school (grades 6-12) about tobacco usage in the United States of America.

The data for this survey is freely available online at this website with data from 1999, 2000, 2002, 2004, 2006, 2009, and 2011-2019. We will be using data from 2015-2019 due to the fact that these years are the most recent that asked questions recarding e-cigarette usage.

Each year includes documentation, such as a codebook and an excel file containing the data:

Therefore, since we are using data from 2015-2019, the data we are interested in is located in 5 different excel files, each with their own codebook.

The codebook contains information describing the data within the excel file.

As you can see the excel file contains very short variable names and values, and it is not clear what they mean without the codebook:

The codebook explains what the variables (the columns) are:

And the codebook explains what the values for each variable are:

We will explain more later about what the values on the right indicate.

The reason that there are codebooks for each year is because the questions asked year varied slightly.

The data in this survey is what is called pooled cross-sectional data. In otherwords, different subsets of students are surveyed each year and it is not clear which, if any , individuals particpate from one year to the next.

Data Import


Reading in the excel files


Since these excel files are so large (each has roughly 20,000 rows), it takes a bit of time for the data to load. To make the process faster, we previously imported these files, selected only our questions of interest, and saved this data as csv files.

Click here for details on how the data was originally imported

First we created a list of filenames of all the different excel files. Using the here() function of the here package, we looked in all the directories of the project. The list.files() function looked for all files with .xlsx within these subdirectories.

[1] "docs/2015-nyts-dataset-and-codebook-microsoft-excel/nyts2015.xlsx"
[2] "docs/2016-nyts-dataset-and-codebook-microsoft-excel/nyts2016.xlsx"
[3] "docs/2017-nyts-dataset-and-codebook-microsoft-excel/nyts2017.xlsx"
[4] "docs/2018-nyts-dataset-and-codebook-microsoft-excel/nyts2018.xlsx"
[5] "docs/2019-nyts-dataset-and-codebook-microsoft-excel/nyts2019.xlsx"

All the files were read using read_excel() of the readxl package. Using the map() function of the purrr package this was done efficiently for all of the excel files in the list using one command. The . is used to indicate that we want to apply the read_excel() function to the data that we just piped into the map() function.

Here will also used the %>% pipe which can be used to define the input for later sequential steps.

This created a single list of tibbles (one for each file).

Each excel file name was extracted using the str_extract() function of the stringr package. Here we are keeping occurances of the character string “nyts201” followed by a “5”,“6”,“7”,“8”, or “9”.

[1] "nyts2015" "nyts2016" "nyts2017" "nyts2018" "nyts2019"

These names became the names of the tibbles in the list of tibbles.

Specific columns were selected from each of the tibbles using the varaible name, as identified in the codebook for being of interest. In some cases functions like starts_with() of the dplyr package were used to select several variables. Most of the survey questions about tobacco use start with an "E" or a "C" according to the codebooks.

tbl_files[["nyts2015"]] <- tbl_files[["nyts2015"]] %>%
    dplyr::select(psu, #Primary Sampling Unit
                  finwgt,#Analysis Weight
                  stratum,#Sampling stratum
                  Qn1, #Age
                  Qn2, #Sex
                  Qn3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  )


tbl_files[["nyts2016"]] <- tbl_files[["nyts2016"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  ) 

tbl_files[["nyts2017"]] <- tbl_files[["nyts2017"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  )

tbl_files[["nyts2018"]] <- tbl_files[["nyts2018"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  )

tbl_files[["nyts2019"]] <- tbl_files[["nyts2019"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q40, #Brand, e-cigarettes
                  Q62A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q62B, #Clove or spice
                  Q62C, #Fruit 
                  Q62D, #Chocolate
                  Q62E, #Alcoholic Drink
                  Q62F, #Candy/Desserts/Other Sweets
                  Q62G, #Some Other Flavor Not Listed Here 
                  )

A directory was created using the base dir.create() function called data_reduced for the csv files. New csv files were created for each of the tbls in the list using the write_csv() function of the readr package. This was done all at once using the base mappy() function.

Now we will show how to read in the data from the five csv files that were created from the five different excel files.

Data Exploration and Wrangling


Renaming variables


Here is what the data for 2015 looks like:

Rows: 17,711
Columns: 29
$ psu        <chr> "015438", "015438", "015438", "015438", "015438", "015438"…
$ finwgt     <dbl> 216.7268, 324.9620, 324.9620, 397.1552, 264.8745, 264.8745…
$ stratum    <chr> "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "B…
$ Qn1        <dbl> 10, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
$ Qn2        <dbl> 2, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2…
$ Qn3        <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 5…
$ ECIGT      <dbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1…
$ ECIGAR     <dbl> 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2…
$ ESLT       <dbl> 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ EELCIGT    <dbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1…
$ EROLLCIGTS <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2…
$ EFLAVCIGTS <dbl> 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ EBIDIS     <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ EFLAVCIGAR <dbl> 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 2…
$ EHOOKAH    <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ EPIPE      <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ ESNUS      <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ EDISSOLV   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CCIGT      <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CCIGAR     <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CSLT       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CELCIGT    <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CROLLCIGTS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CFLAVCIGTS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CBIDIS     <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CHOOKAH    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CPIPE      <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CSNUS      <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CDISSOLV   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…

Currently, it isn’t very clear what most of the variables indicate.

We want to rename variables like Qn1 to something more meaningful like Age.

To do this we will use the rename() function of the dplyr package. The new name is always listed first before the =.

We also want to update the values for age and grade, so that they are more understandable.

Recall from the codebook for this year, that age isn’t listed in the way one might expect:

The same is true for grade:

This is why it is so important to always check the codebook!!

We also need to add new variables about usage of brands and specific flavors of e-cigrattes, because although later years have questions about this, this year unfortunately only has broad questions about flavors. Eventually we want to put the data from each year together, so we need the same variables for each year. Thus we need to add variables that just say NA for the brand and flavor data for this year.

To change the existing variables and create new variabels, we will use the mutate function of the dplyr package.

Now we can see how the data has changed:

Rows: 17,711
Columns: 38
$ psu                  <chr> "015438", "015438", "015438", "015438", "015438"…
$ finwgt               <dbl> 216.7268, 324.9620, 324.9620, 397.1552, 264.8745…
$ stratum              <chr> "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3",…
$ Age                  <dbl> 10, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
$ Sex                  <dbl> 2, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, …
$ Grade                <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
$ ECIGT                <dbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, …
$ ECIGAR               <dbl> 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ ESLT                 <dbl> 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, …
$ EELCIGT              <dbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, …
$ EROLLCIGTS           <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, …
$ EFLAVCIGTS           <dbl> 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ EBIDIS               <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ EFLAVCIGAR           <dbl> 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, …
$ EHOOKAH              <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, …
$ EPIPE                <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ ESNUS                <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ EDISSOLV             <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CCIGT                <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CCIGAR               <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CSLT                 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CELCIGT              <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CROLLCIGTS           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CFLAVCIGTS           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CBIDIS               <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CHOOKAH              <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CPIPE                <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CSNUS                <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CDISSOLV             <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ brand_ecig           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ menthol              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ clove_spice          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ fruit                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ chocolate            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ alcoholic_drink      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ candy_dessert_sweets <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ other                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ no_use               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Updating Values

We also want to replace the value of 19 for Age to be ">18" and the value of 13 for Grade to be replaced with "Ungraded/Other" Also accroding to the codebooks, numeric values of 1 indicate a survey answer of FALSE, while a value of 2 indicates TRUE. Sex needs to be recoded, but it is a bit unclear if it is encoded slightly differently across years.

Let’s create a function to make all these updates except for Sex:

Now let’s update the Sex encoding:

It’s a bit difficult to tell if the encoding was the same across years, so let’s check using the count() function of the dplyr package.

According to the codebook, we should have:
1) 8,958 males in 2015 2) 10,438 males in 2016 3) 8,881 males in 2017
4) 10,069 males in 2018
5) 9,803 males in 2019

$nyts2015
# A tibble: 3 x 2
    Sex     n
  <dbl> <int>
1     1  8958
2     2  8622
3    NA   131

$nyts2016
# A tibble: 3 x 2
    Sex     n
  <dbl> <int>
1     1 10438
2     2 10082
3    NA   155

$nyts2017
# A tibble: 3 x 2
    Sex     n
  <dbl> <int>
1     1  8881
2     2  8815
3    NA   176

$nyts2018
# A tibble: 3 x 2
    Sex     n
  <dbl> <int>
1     1 10069
2     2  9920
3    NA   200

$nyts2019
# A tibble: 3 x 2
  Sex       n
  <chr> <int>
1 .N      116
2 1      9803
3 2      9099

Thus, it looks like males were encoded as 1 for each year.

  1. 8,958 males in 2015 where 1 = male
  2. 10,438 males in 2016 where 1 = male
  3. 8,881 males in 2017 where 1 = male
  4. 10,069 males in 2018 where 1 = male
  5. 9,803 males in 2019 where 1 = male
$nyts2015
# A tibble: 3 x 2
  Sex        n
  <fct>  <int>
1 male    8958
2 female  8622
3 <NA>     131

$nyts2016
# A tibble: 3 x 2
  Sex        n
  <fct>  <int>
1 male   10438
2 female 10082
3 <NA>     155

$nyts2017
# A tibble: 3 x 2
  Sex        n
  <fct>  <int>
1 male    8881
2 female  8815
3 <NA>     176

$nyts2018
# A tibble: 3 x 2
  Sex        n
  <fct>  <int>
1 male   10069
2 female  9920
3 <NA>     200

$nyts2019
# A tibble: 3 x 2
  Sex        n
  <fct>  <int>
1 .N       116
2 male    9803
3 female  9099

Looks good!

The years (2016-2019) that have flavors also need the flavor data to be logical:

Now there are just a few changes needed that are specific to 2019:

Great! Now we have all the same variables and our values don’t need to be handled any differently for any of the years. Thus we can combine the data across years.

# A tibble: 6 x 41
  year  psu   finwgt stratum Age   Sex   Grade ECIGT ECIGAR ESLT  EELCIGT
  <chr> <chr>  <dbl> <chr>   <fct> <fct> <fct> <lgl> <lgl>  <lgl> <lgl>  
1 nyts… 0154…   217. BR3     18    fema… 12    FALSE TRUE   FALSE FALSE  
2 nyts… 0154…   325. BR3     17    male  12    TRUE  TRUE   FALSE TRUE   
3 nyts… 0154…   325. BR3     18    male  12    FALSE FALSE  FALSE FALSE  
4 nyts… 0154…   397. BR3     18    male  12    TRUE  FALSE  FALSE TRUE   
5 nyts… 0154…   265. BR3     18    fema… 12    FALSE FALSE  FALSE FALSE  
6 nyts… 0154…   265. BR3     18    fema… 12    TRUE  FALSE  FALSE TRUE   
# … with 30 more variables: EROLLCIGTS <lgl>, EFLAVCIGTS <lgl>, EBIDIS <lgl>,
#   EFLAVCIGAR <lgl>, EHOOKAH <lgl>, EPIPE <lgl>, ESNUS <lgl>, EDISSOLV <lgl>,
#   CCIGT <lgl>, CCIGAR <lgl>, CSLT <lgl>, CELCIGT <lgl>, CROLLCIGTS <lgl>,
#   CFLAVCIGTS <lgl>, CBIDIS <lgl>, CHOOKAH <lgl>, CPIPE <lgl>, CSNUS <lgl>,
#   CDISSOLV <lgl>, brand_ecig <chr>, menthol <lgl>, clove_spice <lgl>,
#   fruit <lgl>, chocolate <lgl>, alcoholic_drink <lgl>,
#   candy_dessert_sweets <lgl>, other <lgl>, no_use <lgl>, EHTP <lgl>,
#   CHTP <lgl>

Looks like we just need to remove "nyts" from the year variable that we created from the names of the tibbles in our list.

Here is our clean and wrangled data:

# A tibble: 6 x 41
   year psu   finwgt stratum Age   Sex   Grade ECIGT ECIGAR ESLT  EELCIGT
  <dbl> <chr>  <dbl> <chr>   <fct> <fct> <fct> <lgl> <lgl>  <lgl> <lgl>  
1  2015 0154…   217. BR3     18    fema… 12    FALSE TRUE   FALSE FALSE  
2  2015 0154…   325. BR3     17    male  12    TRUE  TRUE   FALSE TRUE   
3  2015 0154…   325. BR3     18    male  12    FALSE FALSE  FALSE FALSE  
4  2015 0154…   397. BR3     18    male  12    TRUE  FALSE  FALSE TRUE   
5  2015 0154…   265. BR3     18    fema… 12    FALSE FALSE  FALSE FALSE  
6  2015 0154…   265. BR3     18    fema… 12    TRUE  FALSE  FALSE TRUE   
# … with 30 more variables: EROLLCIGTS <lgl>, EFLAVCIGTS <lgl>, EBIDIS <lgl>,
#   EFLAVCIGAR <lgl>, EHOOKAH <lgl>, EPIPE <lgl>, ESNUS <lgl>, EDISSOLV <lgl>,
#   CCIGT <lgl>, CCIGAR <lgl>, CSLT <lgl>, CELCIGT <lgl>, CROLLCIGTS <lgl>,
#   CFLAVCIGTS <lgl>, CBIDIS <lgl>, CHOOKAH <lgl>, CPIPE <lgl>, CSNUS <lgl>,
#   CDISSOLV <lgl>, brand_ecig <chr>, menthol <lgl>, clove_spice <lgl>,
#   fruit <lgl>, chocolate <lgl>, alcoholic_drink <lgl>,
#   candy_dessert_sweets <lgl>, other <lgl>, no_use <lgl>, EHTP <lgl>,
#   CHTP <lgl>

Reminder: Current users are a subset of ever users.

Data Visualization


Question 1

For lots of pivot_longer() that are the same… can use pivot_longer_spec()

spec <- relig_income %>% build_longer_spec( cols = -religion, names_to = “income”, values_to = “count” ) pivot_longer_spec(relig_income, spec)

But need to decide if we are going to create some new data frames…

Question 3

What vaping brands and flavors appear to be used the most frequently?

Huang J, Duan Z, Kwok J, et al. Tob Control 2019;28:146–151.

Huang J, Duan Z, Kwok J, et al. Tob Control 2019;28:146–151.

Paper

plot5 <- nyts_data %>%
  filter(year!=2015) %>%
  filter(menthol==TRUE|
           clove_spice==TRUE|
           fruit==TRUE|
           chocolate==TRUE|
           alcoholic_drink==TRUE|
           candy_dessert_sweets==TRUE|
           other==TRUE) %>%
  mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  mutate(ecig_only_ever = case_when(ecig_ever == TRUE &
                                      non_ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           ecig_only_current = case_when(ecig_current == TRUE &
                                           non_ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           non_ecig_only_ever = case_when(non_ecig_ever == TRUE &
                                            ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           non_ecig_only_current = case_when(non_ecig_current == TRUE &
                                               ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE)) %>%
  mutate(Group = case_when(ecig_only_ever==TRUE |
                             ecig_only_current==TRUE ~ "Only e-cigarettes",
                         non_ecig_only_ever==TRUE |
                           non_ecig_only_current==TRUE ~ "Only other products",
                                    TRUE ~ "Both")) %>%
  filter(Group!="Both") %>%
  group_by(year, Group) %>%
  summarise(`Menthol`=(sum(menthol, na.rm = TRUE)*100)/
              sum(!is.na(menthol)),
              `Clove or Spice`=(sum(clove_spice, na.rm = TRUE)*100)/
              sum(!is.na(clove_spice)),
              `Fruit`=(sum(fruit, na.rm = TRUE)*100)/sum(!is.na(fruit)),
              `Chocolate`=(sum(chocolate, na.rm = TRUE)*100)/
              sum(!is.na(chocolate)),
              `Alcoholic Drink`=(sum(alcoholic_drink, na.rm = TRUE)*100)/
              sum(!is.na(alcoholic_drink)),
              `Candy/Desserts/Sweets`=(sum(candy_dessert_sweets, na.rm = TRUE)*100)/
              sum(!is.na(candy_dessert_sweets)),
            `Other`=(sum(other, na.rm = TRUE)*100)/
              sum(!is.na(other)),
            Respondents=n()) %>%
  #converting all columns between and including Menthol and Other to one column called Flavor
  pivot_longer(cols = Menthol:Other, names_to = "Flavor", values_to = "Percentage of students") %>%
  filter(!is.na(`Percentage of students`),
         Flavor!="Other") %>%
  group_by(Flavor) %>%
  mutate(affirmative=(Respondents * `Percentage of students`)/100) %>%
  mutate(flavor_mean = sum(affirmative)/sum(Respondents)) %>%
  ungroup() %>%
  mutate(flavor_mean_rank = dense_rank(flavor_mean),
         Flavor = fct_reorder(Flavor, flavor_mean_rank)) %>%
  ggplot(aes(x=year, y=`Percentage of students`, color=Group)) +
  facet_wrap(.~Flavor,ncol=3) +
  geom_line() + 
  geom_point(show.legend = FALSE) + 
  theme_minimal() +
  theme(legend.position = "bottom",
          axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90)) + 
  labs(title = "Among users of only one type of product, what vaping flavors appear to be used the most frequently?",
       subtitle = "Percent reporting only e-cigarette use vs only other nicotine product use")

plot5

Question 4

Have vaping rates possibly influenced tobacco/nicotine use?

plot7 <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
    group_by(year) %>%
    summarise(ecig_ever_year=(sum(ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(ecig_ever)),
              ecig_current_year=(sum(ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(ecig_current)),
              non_ecig_ever_year=(sum(non_ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_ever)),
              non_ecig_current_year=(sum(non_ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_current))) %>%
    pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    mutate(User = case_when(Category =="ecig_ever_year" ~ "Ever",
                           Category =="non_ecig_ever_year" ~ "Ever",
                           Category =="ecig_current_year" ~ "Current",
                           Category =="non_ecig_current_year" ~ "Current")) %>%
    mutate(Product = case_when(Category =="ecig_ever_year" ~ "E-cigarettes",
                           Category =="non_ecig_ever_year" ~ "Other products",
                           Category =="ecig_current_year" ~ "E-cigarettes",
                           Category =="non_ecig_current_year" ~ "Other products")) %>%
    filter(User=="Ever") %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Product)) +
    geom_line(linetype=1) + # geom_bar(stat="identity", position = "dodge", color="black") +
  geom_point(show.legend = FALSE) +
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette ever use compare to ever use of other products over the years?",
         subtitle = "E-cigarette and non-e-cigarette use",
         y = "% of students")

plot7

plot8 <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
    group_by(year) %>%
    summarise(ecig_ever_year=(sum(ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(ecig_ever)),
              ecig_current_year=(sum(ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(ecig_current)),
              non_ecig_ever_year=(sum(non_ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_ever)),
              non_ecig_current_year=(sum(non_ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_current))) %>%
    pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(User = case_when(Category =="ecig_ever_year" ~ "Ever",
                           Category =="non_ecig_ever_year" ~ "Ever",
                           Category =="ecig_current_year" ~ "Current",
                           Category =="non_ecig_current_year" ~ "Current")) %>%
    mutate(Product = case_when(Category =="ecig_ever_year" ~ "E-cigarettes",
                           Category =="non_ecig_ever_year" ~ "Other products",
                           Category =="ecig_current_year" ~ "E-cigarettes",
                           Category =="non_ecig_current_year" ~ "Other products")) %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Product, linetype=User)) +
    geom_line() +
  geom_point(show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette use compare to use of other products over the years?",
         subtitle = "E-cigarette and non-e-cigarette use",
         y = "% of students")

plot8

Weighted Sample

plotA_w <- nyts_data %>%
    mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE),
           tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                    tobacco_sum_ever ==0 ~ FALSE),
           tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                    tobacco_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
  summarise(tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
            tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))  %>%
  mutate_at(vars(-year), "*", 100) %>%
    pivot_longer(cols = -year, names_to = "Type", values_to = "Percentage of students") %>%
    # gather(key=Type,
    #        value=`Percentage of students`,
    #        -year) %>%
  mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
                          grepl("_upp", Type) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Type) ~ "Ever",
                          grepl("current", Type) ~ "Current",
                          TRUE ~ "Mean")) %>%
  dplyr::select(-Type) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  ggplot(aes(x=year,y=Mean)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
    scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Nicotine product users more prevalent after 2017",
         y = "% of students")

plotB_w <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
    summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year)  %>%
  mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("current", Category) ~ "Current",
                          TRUE ~ "Ever",),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  filter(User=="Ever") %>%
  dplyr::rename("Lower_temp" = Upper,
                "Upper_temp" = Lower) %>%
  dplyr::rename("Lower"=Lower_temp,
                "Upper"=Upper_temp) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(linetype=1) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% ever trying e-cigarettes increases &\n% ever trying other products decreases",
         y = "% of students")

#### the wrangling looks the same as the above plot...
plotC_w <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
  summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% Using e-cigarettes increases &\n% using Other products decreases",
         y = "% of students")

title_w <- ggdraw() + 
  draw_label(
    expression("Have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w <- plot_grid(plotA_w,
                     rel_widths = c(1),
                     align = "v",
                     axis = "bt")
plotsBC_w <- plot_grid(plotB_w,
                     plotC_w,
                     rel_widths = c(1,1),
                     align = "v",
                     axis = "bt")

legend_w <- get_legend(plotB_w +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w <- plot_grid(title_w,
                      plotsA_w,
                      plotsBC_w,
                      legend_w,
                      ncol = 1,
                      rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                      scale = 1.0)

figure_w

Hypothethical Cohort

plotA_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE),
           tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                    tobacco_sum_ever ==0 ~ FALSE),
           tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                    tobacco_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
  summarise(tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
            tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))  %>%
  mutate_at(vars(-year), "*", 100) %>%
    pivot_longer(cols = -year, names_to = "Type", values_to = "Percentage of students")%>%
    # gather(key=Type,
    #        value=`Percentage of students`,
    #        -year) %>%
  mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
                          grepl("_upp", Type) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Type) ~ "Ever",
                          grepl("current", Type) ~ "Current",
                          TRUE ~ "Mean")) %>%
  dplyr::select(-Type) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  ggplot(aes(x=year,y=Mean)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_linetype_manual(values = c(2,1)) +
    scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Nicotine product users becoming increasingly prevalent",
         y = "% of students")

plotB_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
    summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
   pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students")%>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year)  %>%
  mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("current", Category) ~ "Current",
                          TRUE ~ "Ever",),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  filter(User=="Ever") %>%
  dplyr::rename("Lower_temp" = Upper,
                "Upper_temp" = Lower) %>%
  dplyr::rename("Lower"=Lower_temp,
                "Upper"=Upper_temp) %>%
  ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(linetype=1) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% ever trying nicotine products increases",
         y = "% of students")

plotC_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
  summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students")%>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "E-cigarette use surpasses use of other nicotine products",
         y = "% of students")

title_w_8 <- ggdraw() + 
  draw_label(
    expression("Among"~8^th~"graders in 2015, have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w_8 <- plot_grid(plotA_w_8,
                        rel_widths = c(1),
                        align = "v",
                        axis = "bt")

plotsBC_w_8 <- plot_grid(plotB_w_8,
                         plotC_w_8,
                         rel_widths = c(1,1),
                         axis = "bt")

legend_w_8 <- get_legend(plotB_w_8 +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w_8 <- plot_grid(title_w_8,
                        plotsA_w_8,
                        plotsBC_w_8,
                        legend_w_8,
                        ncol = 1,
                        rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                        scale = 1.0
)

figure_w_8

Final Figure

title_final <- ggdraw() +
  draw_label(
    expression("Have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=16,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_uw_final <- ggdraw() + 
  draw_label(
    expression(~6^th~"-"~12^th~"graders, unweighted"),
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_final <- ggdraw() + 
  draw_label(
    expression(~6^th~"-"~12^th~"graders, weighted"),
    fontface = 'bold',
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_8_final <- ggdraw() + 
  draw_label(
    expression(~8^th~"graders in 2015, weighted"),
    fontface = 'bold',
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_final <- plot_grid(subtitle_uw_final,
                            subtitle_w_final,
                            subtitle_w_8_final,
                            ncol = 3)

plotsA_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of e-cigarette use by user type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_final <- plot_grid(plotA_uw + theme(plot.title = element_blank()),
                          plotA_w + theme(plot.title = element_blank()),
                          plotA_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsB_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of ever use by product type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsB_final <- plot_grid(plotB_uw + theme(plot.title = element_blank()),
                          plotB_w + theme(plot.title = element_blank()),
                          plotB_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsC_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of nicotine product use by product & user type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsC_final <- plot_grid(plotC_uw + theme(plot.title = element_blank()),
                          plotC_w + theme(plot.title = element_blank()),
                          plotC_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

legend_final <- get_legend(plotB_w +
                             theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

final_plot <- plot_grid(title_final,
          plotsA_title_final,
          subtitle_final,
          plotsA_final,
          plotsB_title_final,
          subtitle_final,
          plotsB_final,
          plotsC_title_final,
          subtitle_final,
          plotsC_final,
          legend_final,
          ncol = 1,
          rel_heights = c(0.2,
                          0.2,
                          0.1,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.1))

final_plot

Suggested Homework


  • Apply survey weights to one of the figures produced in this case study in which weighted estimates were not produced. Include error bars in the updated figure.
    • Does the figure change after the application of survey weights?
    • If so, describe how.
  • Reproduce final_plot above for a different cohort of your choice.

Notes

Ever and current variables are limited to those shared by all years of data included in this case study.

New code: Knit time: 35.214 secs. Previous code: ~ 3 m.

Problems

I had difficulty producing a plot that succinctly presented a trend. It’s very easy to produce plots that are very useful once one is familiar with the data. Some plots, however, cannot stand alone and need additional context to be clear for those without prior knowledge of the data. When I first shared a plot I had been working on with others, it became clear that in my effort to present a complicated idea briefly I had left out information that would make the trend easily interpretable. To solve this issue, I began to present visualizations of the data alongside my original plot. The final figure I created contained several additional plots, each presenting the same trend at a different level than my initial plot.

My “centerpiece” plot is the middle plot in final_plot. The 8 plots around it help provide a very clear picture of what is going on in the US with regards to e-cigarette use and nicotine product use at large. On its own, it’s difficult to understand the trends in the US and how important the weighting scheme is for inference. Once you add the left and right columns, it’s clear what is going on.

Data Analysis


content header


Summary


Session info


R version 4.0.1 (2020-06-06)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cowplot_1.0.0      ggplot2_3.3.1      srvyr_0.3.10       summarytools_0.9.6
 [5] readr_1.3.1        forcats_0.5.0      tidyr_1.1.0        stringr_1.4.0     
 [9] dplyr_1.0.0        purrr_0.3.4        readxl_1.3.1       knitr_1.28        
[13] here_0.1          

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0   xfun_0.14          pander_0.6.3       mitools_2.4       
 [5] splines_4.0.1      lattice_0.20-41    tcltk_4.0.1        colorspace_1.4-1  
 [9] vctrs_0.3.0        generics_0.0.2     htmltools_0.4.0    yaml_2.2.1        
[13] base64enc_0.1-3    utf8_1.1.4         survival_3.1-12    rlang_0.4.6       
[17] pillar_1.4.4       glue_1.4.1         withr_2.2.0        DBI_1.1.0         
[21] pryr_0.1.4         matrixStats_0.56.0 lifecycle_0.2.0    plyr_1.8.6        
[25] munsell_0.5.0      gtable_0.3.0       cellranger_1.1.0   codetools_0.2-16  
[29] evaluate_0.14      labeling_0.3       fansi_0.4.1        highr_0.8         
[33] Rcpp_1.0.4.6       backports_1.1.7    scales_1.1.1       checkmate_2.0.0   
[37] magick_2.3         farver_2.0.3       rapportools_1.0    hms_0.5.3         
[41] digest_0.6.25      stringi_1.4.6      survey_4.0         rprojroot_1.3-2   
[45] grid_4.0.1         cli_2.0.2          tools_4.0.1        magrittr_1.5      
[49] tibble_3.0.1       crayon_1.3.4       pkgconfig_2.0.3    ellipsis_0.3.1    
[53] Matrix_1.2-18      lubridate_1.7.8    assertthat_0.2.1   rmarkdown_2.2     
[57] R6_2.4.1           compiler_4.0.1    
---
title: "Open Case Studies : Vaping Behaviors in American Youth"
author: "Michael Ontiveros, Carrie Wright, PhD. "

css: style.css
output:
  html_document:
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes

---
<style>
#TOC {
  background: url("https://opencasestudies.github.io/img/logo.jpg");
  background-size: contain;
  padding-top: 240px !important;
  background-repeat: no-repeat;
}
</style>

```{r, echo=FALSE}
knit_time_start <- Sys.time()
```

```{r, echo=FALSE}
knitr::opts_chunk$set(fig.width=10, fig.height=8, dpi=300) 
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
                      message = FALSE, warning = FALSE, cache = FALSE,
                      fig.align = "center", out.width = '90%')
library(here)
library(knitr)
```


#### {.outline }
```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "final_plot.png"))
```
####




## {.disclaimer_block}

**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

## **Motivation**
*** 
According to a recent [report](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w){target="_blank"}, overall tobacco use **increased** in youths (middle schooland high school students) in the United States in 2017 and 2018, despite previous years of declining use.

This major increase is attributed to an increase in the use of electronic cigaarette (e-cigarette) products.

E-cigarettes are referred to by many different names, including but not limited to:

1) Electronic nicotine delivery systems (ENDS)
2) Vapes
3) e-hookahs
4) vape pens
5) tanks
6) mods

The devices vary greatly:

```{r, echo = FALSE, fig.align ="center"}

include_graphics("https://www.lung.org/getmedia/8ac8ab8c-e7fc-497b-8384-441615f50ff0/ecigs_K.jpg.jpg")
```

##### [[source](https://www.lung.org/quit-smoking/e-cigarettes-vaping/lung-health)]

See this [CDC guide](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/pdfs/ecigarette-or-vaping-products-visual-dictionary-508.pdf){target="_blank"} and the [American Lung Association website](https://www.lung.org/quit-smoking/e-cigarettes-vaping/lung-health){target="_blank"} for more information. 

The report found that:

> During 2017–2018, current use of any tobacco product **increased 38.3%** (from 19.6% to 27.1%) among high school students and **28.6%** (from 5.6% to 7.2%) among middle school students; e-cigarette use **increased 77.8%** (from 11.7% to 20.8%) among high school students and **48.5%** (from 3.3% to 4.9%) among middle school students.


In 2018, the [Federal Drug Administration (FDA) in the United States](https://acsjournals.onlinelibrary.wiley.com/doi/full/10.1002/cncr.31868){target="_blank"} stated that e-cigarette usage use among youth reached:  

> “nothing short of an **epidemic proportion of growth**”


In this case study, we will be invistigating the same data used in the report that generated the above findings. This data comes from the [The National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"}.

#### {.reference_block}

Gentzke, Andrea S., Melisa Creamer, Karen A. Cullen, Bridget K. Ambrose, Gordon Willis, Ahmed Jamal, and Brian A. King.  “Vital Signs: Tobacco Product Use Among Middle and High School Students - United States, 2011-2018.” **MMWR. Morbidity and Mortality Weekly Report** 68 (6): 157–64 (2019).

####


## **Main Questions**
*** 

#### {.main_question_block}
<b><u> Our main question: </u></b>

1) How has tobacco/nicotine use by American youths changed since 2015? 
2) How do vaping rates compare between males and females?
3) What vaping brands and flavors appear to be used the most frequently?  
We will base this on the following survey questions:   
> "During the past 30 days, what brand of e-cigarettes did you usually use?"   
>"What flavors of tobacco products have you used in the past
30 days?"  

4) Have vaping rates possibly influenced tobacco/nicotine use?

####


## **Learning Objectives** 
*** 

In this case study, we will cover how to import data from multiple files efficiently, how to import data from excel files, and how to make a variety of visualizations to compare multiple groups across time. We will also demonstrate how to work with codebooks. We will cover the concept of survey weighting and introduce the `srvyr` package. We will discuss the difference between pooled cross-sectional data and panel data. We will especially focus on using packages and functions from the [`Tidyverse`](https://www.tidyverse.org/){target="_blank"} for wrangling data, such as `tidyr` and `dpyr` and for visualization, such as as `ggplot2`. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.


```{r, out.width = "20%", echo = FALSE, fig.align ="center"}

include_graphics("https://tidyverse.tidyverse.org/logo.png")
```


*** 


We will begin by loading the packages that we will need:

```{r}
library(here)
library(readxl)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
library(forcats)
library(readr)
library(summarytools)
library(srvyr)
library(ggplot2)
library(cowplot)
```

 Package   | Use                                                                         
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data  
[readxl](https://readxl.tidyverse.org/){target="_blank"}      | to import the data in the excel files  
[purrr](https://purrr.tidyverse.org/){target="_blank"}   | to import the data in all the different excel and csv files efficiently
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data  
[summarytools](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data in a different style   
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to rearrange data in wide and long formats 


[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the character strings within the data   
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers  
[ggpubr](https://cran.r-project.org/web/packages/ggpubr/index.html){target="_blank"}    | to easily add regression line equations to plots  
[forcats](https://forcats.tidyverse.org/){target="_blank"}    | to change details about factors (categorical variables)  
[lmerTest](https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)| to perform linear mixed model testing   
[car](https://cran.r-project.org/web/packages/car/car.pdf)| to perform Levene's Test of Homogeneity of Variances   
[ggiraph](https://cran.r-project.org/web/packages/ggiraph/index.html)| to make plots interactive   
[ggforce](https://cran.r-project.org/web/packages/ggforce/ggforce.pdf)| to modify facets in plots  
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)| to plot in color palette    
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined   [skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data   




The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.


## **Context**
*** 

According to the cited [Morbidity and Mortality Weekly Report](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w) this was what was already known about this topic and the implications of this study:

```{r, echo = FALSE, fig.align ="center", out.width = "800 px"}

knitr::include_graphics(here::here("img", "context.png"))

```


Importantly, the vapors used in e-cigarettes contain harmful chemicals:

```{r, echo = FALSE, fig.align ="center"}

include_graphics("https://www.cdc.gov/tobacco/basic_information/e-cigarettes/images/e-cigarette-aerosol-can-contain-harmful-ingredients-desktop-700.jpg")
```

E-cigarette usage has also been associated with [lung injury]((https://www.frontiersin.org/articles/10.3389/fphar.2019.01619/full)){target="_blank"}

```{r, echo = FALSE, fig.align ="center", out.width = "800 px"}

knitr::include_graphics(here::here("img", "lung.png"))
```

See [here](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/Quick-Facts-on-the-Risks-of-E-cigarettes-for-Kids-Teens-and-Young-Adults.html){target="_blank"} for additonal information about the potential health effects of e-cigarettes in teens and young adults.

## **Limitations**
*** 
There are some important considerations regarding this data analysis to keep in mind: 

1) The data included in the [National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"} does not follow the same individual students over time.  A [longitudinal study](https://www.bmj.com/about-bmj/resources-readers/publications/epidemiology-uninitiated/7-longitudinal-studies){target="_blank"} that does follow the same individuals over time collects data called [panel data](https://en.wikipedia.org/wiki/Panel_data). The data in this study is called pooled [cross-sectional data]https://en.wikipedia.org/wiki/Cross-sectional_data, this data is obtained from random collection of obervations across time.

According to wikipedia:
>Panel data differs from pooled cross-sectional data across time, because it deals with the observations on the same subjects in different times whereas the latter observes different subjects in different time periods

2) The data also includes percentages of students that reported use of particular tobacco product, but the survey questions did not ask how much uses of one product compared to another - for example the survey asked questions like:"What flavors of tobacco products have you used in the past 30 days?" Thus, it is unclear how often one flavor was used by the same individual over another.

While [gender](https://www.genderspectrum.org/quick-links/understanding-gender/){target="_blank"} and [sex](https://www.who.int/genomics/gender/en/index1.html){target="_blank"} are not actually binary, the dataused in this analysis only contains information for groups of individuals who answered the survey questions as male or female. 

## **What are the data?**
*** 
 
The data in this case study comes from the [National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"} which is an annual survey that asks students  in high school and middle school (grades 6-12) about tobacco usage in the United States of America.

The data for this survey is freely available online at this [website](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/data/index.html){target="_blank"} with data from 1999, 2000, 2002, 2004, 2006, 2009,  and 2011-2019. We will be using data from **2015-2019** due to the fact that these years are the most recent that asked questions recarding e-cigarette usage.

Each year includes documentation, such as a [codebook](https://www.icpsr.umich.edu/icpsrweb/content/shared/ICPSR/faqs/what-is-a-codebook.html) and an excel file containing the data:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "data.png"))
```
Therefore, since we are using data from **2015-2019**, the data we are interested in is located in 5 different excel files, each with their own [codebook](https://www.icpsr.umich.edu/icpsrweb/content/shared/ICPSR/faqs/what-is-a-codebook.html).

The [codebook](https://www.icpsr.umich.edu/icpsrweb/content/shared/ICPSR/faqs/what-is-a-codebook.html) contains information describing the data within the excel file. 

As you can see the excel file contains very short variable names and values, and it is not clear what they mean without the codebook:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "excel.png"))
```

The codebook explains what the variables (the columns) are:
```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "variables.png"))
```

And the codebook explains what the values for each variable are:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "qn1.png"))
```

We will explain more later about what the values on the right indicate.

The reason that there are codebooks for each year is because the questions asked year varied slightly.


The data in this survey is what is called pooled cross-sectional data. In otherwords, different subsets of students are surveyed each year and it is not clear which, if any , individuals particpate from one year to the next.

## **Data Import**
*** 

### Reading in the excel files
***
Since these excel files are so large (each has roughly 20,000 rows), it takes a bit of time for the data to load. To make the process faster, we previously imported these files, selected only our questions of interest, and saved this data as csv files. 

<details><summary> Click here for details on how the data was originally imported </summary>

First we created a list of filenames of all the different excel files. Using the `here()` function of the `here` package, we looked in all the directories of the project.
The `list.files()` function looked for all files with .xlsx within these subdirectories.
```{r}
excel_files<-list.files(here::here(), recursive = TRUE,
                  pattern = "*.xlsx")
excel_files
```

All the files were read using `read_excel()` of the `readxl` package. Using the `map()` function of the `purrr` package this was done efficiently for all of the excel files in the list using one command. The `.` is used to indicate that we want to apply the `read_excel()` function to the data that we just piped into the `map()` function.

Here will also used the `%>%` pipe which can be used to define the input for later sequential steps. 

This created a single list of tibbles (one for each file). 
```{r, eval = FALSE}
tbl_files <- excel_files %>% 
       map(~readxl::read_excel(.))
```

Each excel file name was extracted using the `str_extract()` function of the `stringr` package. Here we are keeping occurances of the character string "nyts201" followed by a "5","6","7","8", or "9".
```{r}
tbl_names <- excel_files %>%
  str_extract("nyts[2][0][1][5-9]")

tbl_names
```

These names became the names of the tibbles in the list of tibbles.
```{r, eval = FALSE}
names(tbl_files) <- tbl_names
```


Specific columns were selected from each of the tibbles using the varaible name, as identified in the codebook for being of interest.
In some cases functions like `starts_with()` of the `dplyr` package were used to select several variables. Most of the survey questions about tobacco use start with an `"E"` or a `"C"` according to the codebooks. 

```{r, eval = FALSE}

tbl_files[["nyts2015"]] <- tbl_files[["nyts2015"]] %>%
    dplyr::select(psu, #Primary Sampling Unit
                  finwgt,#Analysis Weight
                  stratum,#Sampling stratum
                  Qn1, #Age
                  Qn2, #Sex
                  Qn3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  )


tbl_files[["nyts2016"]] <- tbl_files[["nyts2016"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  ) 

tbl_files[["nyts2017"]] <- tbl_files[["nyts2017"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  )

tbl_files[["nyts2018"]] <- tbl_files[["nyts2018"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  )

tbl_files[["nyts2019"]] <- tbl_files[["nyts2019"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q40, #Brand, e-cigarettes
                  Q62A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q62B, #Clove or spice
                  Q62C, #Fruit 
                  Q62D, #Chocolate
                  Q62E, #Alcoholic Drink
                  Q62F, #Candy/Desserts/Other Sweets
                  Q62G, #Some Other Flavor Not Listed Here 
                  )

```

A directory was created using the base `dir.create()` function called `data_reduced` for the csv files. 
New csv files were created for each of the tbls in the list using the `write_csv()` function of the `readr` package.
This was done all at once using the base `mappy()` function.

```{r, eval = FALSE}
# 
dir.create("docs/data_reduced")

mapply(write_csv, tbl_files, path=here(paste0("docs/data_reduced/",
                                  names(tbl_files),
                                   '.csv'))
     )

#trying to use map...
tbl_files %>%
map(~write_csv(. , path=paste0("docs/data_reduced2/nyts", pull(year)[[1]], ",csv")))


    tbl_files %>%
map(~write_csv(. , path=paste0("docs/data_reduced2/nyts", names(.), ",csv")))

   
```

</details>



Now we will show how to read in the data from the five csv files that were created from the five different excel files.

### Reading in the CSV files
***

```{r, echo=FALSE, include=FALSE}
start_time <- Sys.time()

end_time <- Sys.time()

test_time <- end_time - start_time

time_message <- paste("Duration of data import:",
      round(as.numeric(test_time) ,3),
      units(test_time)
      )
```


```{r}
nyts_data <- list.files(here::here(),recursive = TRUE,
                   pattern = "*.csv")%>%
  map(~read_csv(.))


nyts_data_names <- list.files(recursive = TRUE,
                         pattern = "*.csv")%>%
  str_extract("nyts[2][0][1][5-9]")

names(nyts_data) <- nyts_data_names
```


## **Data Exploration and Wrangling**
*** 


### Renaming variables  
*** 

Here is what the data for 2015 looks like:

#### {.scrollable }

```{r}
glimpse(nyts_data[["nyts2015"]])
```
####

Currently, it isn't very clear what most of the variables indicate.

We want to rename variables like `Qn1` to something more meaningful like `Age`.

To do this we will use the `rename()` function of the `dplyr` package. The new name is always listed first before the `=`.
```{r}

nyts_data[["nyts2015"]] <- nyts_data[["nyts2015"]] %>%
    dplyr::rename(Age=Qn1,
           Sex=Qn2,
           Grade=Qn3)
```


We also want to update the values for age and grade, so that they are more understandable. 

Recall from the codebook for this year, that age isn't listed in the way one might expect:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "qn1.png"))
```

The same is true for grade:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "grade.png"))
```

**This is why it is so important to always check the codebook!!**

We also need to add new variables about usage of brands and specific flavors of e-cigrattes, because although later years have questions about this, this year unfortunately only has broad questions about flavors. Eventually we want to put the data from each year together, so we need the same variables for each year. Thus we need to add variables that just say `NA` for the brand and flavor data for this year.

To change the existing variables and create new variabels, we will use the `mutate` function of the `dplyr` package.

```{r}
nyts_data[["nyts2015"]] <- nyts_data[["nyts2015"]] %>%
  dplyr::mutate(
           brand_ecig=NA,
           menthol=NA,
           clove_spice=NA,
           fruit=NA,
           chocolate=NA,
           alcoholic_drink=NA,
           candy_dessert_sweets=NA,
           other=NA,
           no_use=NA) 
```


Now we can see how the data has changed:

#### {.scrollable }
```{r}
dplyr::glimpse(nyts_data[["nyts2015"]])
```
####


For the next years, we do have flavor questions, but not a brand question. We can replace these vague question names like `Q50A` with values like `menthol`.  The next 3 years have the same structure for these questions. We will also create a brand variable that is just `NA` values.

```{r}

Update_survey <- function(x) { x %>%
      rename(Age=Q1,
           Sex=Q2,
           Grade=Q3,
           menthol=Q50A,
           clove_spice=Q50B,
           fruit=Q50C,
           chocolate=Q50D,
           alcoholic_drink=Q50E,
           candy_dessert_sweets=Q50F,
           other=Q50G,
           no_use=Q50H) %>%
    mutate(brand_ecig=NA)
}

#few options to apply to the data:
#nyts_data <-nyts_data %>% map_at(vars(nyts2016, nyts2017, nyts2018), Update_survey)
#nyts_data <-nyts_data %>% map_at(c("nyts2016", "nyts2017", "nyts2018"), Update_survey)
nyts_data <-nyts_data %>% map_at(vars(-nyts2015, -nyts2019), Update_survey)
```

The final year has a slightly different data structure - It actually has a brand_ecig variable.

```{r}
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
    rename(brand_ecig=Q40,
           Age=Q1,
           Sex=Q2,
           Grade=Q3,
           menthol=Q62A,
           clove_spice=Q62B,
           fruit=Q62C,
           chocolate=Q62D,
           alcoholic_drink=Q62E,
           candy_dessert_sweets=Q62F,
           other=Q62G) %>%
    mutate(no_use="missing")
```


### Updating Values

We also  want to replace the value of `19` for `Age` to be `">18"` and the value of `13` for `Grade` to be replaced with `"Ungraded/Other"` Also accroding to the codebooks, numeric values of `1` indicate a survey answer of `FALSE`, while a value of `2` indicates `TRUE`. `Sex` needs to be recoded, but it is a bit unclear if it is encoded slightly differently across years.


Let's create a function to make all these updates except for `Sex`:
```{r}
Update_values <- function(x) { x %>%
      mutate(Age=as.numeric(Age)+8,
             Grade=as.numeric(Grade)+5)%>%
      mutate(Age=as.factor(Age),
         Grade=as.factor(Grade),
         ) %>%
  mutate_all(~ replace(., . %in% c("*", "**"), NA))%>%
  mutate(Age=recode(Age,
                    `19` = ">18",
                    ),
         Grade=recode(Grade,
                      `13` = "Ungraded/Other")) %>%
  mutate_at(vars(starts_with("E", ignore.case = FALSE),
                   starts_with("C", ignore.case = FALSE)),
              list(~recode(.,
                           `1` = TRUE,
                           `2` = FALSE,
                           .default = NA,
                           .missing = NA)))
}

nyts_data <-nyts_data %>% map(.,Update_values)

```


Now let's update the `Sex` encoding:

It's a bit difficult to tell if the encoding was the same across years, so let's check using the `count()` function of the `dplyr` package.

According to the codebook, we should have:  
1) 8,958 males in 2015 
2) 10,438 males in 2016 
3) 8,881 males in 2017  
4) 10,069 males in 2018  
5) 9,803 males in 2019  

```{r}
count_sex <-function(x){x %>%count(Sex)}
nyts_data %>% map(., count_sex)
```


Thus, it looks like males were encoded as `1` for each year.

1) 8,958 males in 2015 where 1 = male  
2) 10,438 males in 2016 where 1 = male
3) 8,881 males in 2017 where 1 = male  
4) 10,069 males in 2018 where 1 = male
5) 9,803 males in 2019 where 1 = male


```{r}

  Update_sex <- function(x){x %>%
      mutate(Sex = as.factor(Sex)) %>%
         mutate(Sex = recode(Sex,
                      `1`= "male",
                      `2` = "female"))
  }

nyts_data <-nyts_data %>% map(., Update_sex)
count_sex <-function(x){x %>%count(Sex)}
nyts_data %>% map(., count_sex)

```

Looks good!



The years (2016-2019) that have flavors also need the flavor data to be logical:

```{r}
Update_flavors <- function(x){x %>%
   mutate_at(vars(menthol:no_use),
              list(~recode(.,
                           `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) }

nyts_data  <- nyts_data  %>% map_at(vars(-nyts2015), Update_flavors)

```


Now there are just a few changes needed that are specific to 2019:

```{r}
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
  mutate_all(~ replace(., . %in% c(".N",".S",".Z"), NA)) %>%
  mutate(psu=as.character(psu)) %>%
  mutate(brand_ecig = recode(brand_ecig,
                                             `1` = "Other", #levels 1,8 combined to `Other` 
                                             `2` = "Blu",
                                             `3` = "JUUL",
                                             `4` = "Logic",
                                             `5` = "MarkTen",
                                             `6` = "NJOY",
                                             `7` = "Vuse",
                                             `8` = "Other"))

```

Great! Now we have all the same variables and our values don't need to be handled any differently for any of the years. Thus we can combine the data across years.

```{r}
nyts_data <- nyts_data %>%
  map_df(bind_rows, .id = "year")

head(nyts_data)
```

Looks like we just need to remove `"nyts"` from the year variable that we created from the names of the tibbles in our list.

```{r}
nyts_data <- nyts_data %>%
  mutate(year=as.numeric(str_remove(year,"nyts")))

```

Here is our clean and wrangled data:
```{r}
head(nyts_data)

```

<style>
div.blue { background-color:#e6f0ff; border-radius: 5px; padding: 20px;}
</style>
<div class = "blue">

Reminder: Current users are a subset of ever users. 

</div>






## **Data Visualization**
*** 

### Question 1 

For lots of pivot_longer() that are the same... can use pivot_longer_spec()

spec <- relig_income %>% build_longer_spec(
  cols = -religion,
  names_to = "income",
  values_to = "count"
)
pivot_longer_spec(relig_income, spec)

But need to decide if we are going to create some new data frames...

```{r}
plot1 <- nyts_data %>%
    mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE),
           tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                    tobacco_sum_ever ==0 ~ FALSE),
           tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                    tobacco_sum_current ==0 ~ FALSE)) %>%
    group_by(year) %>%
    summarise(tobacco_ever_year=(sum(tobacco_ever, na.rm = TRUE)*100)/
                sum(!is.na(tobacco_ever)),
              tobacco_current_year=(sum(tobacco_current, na.rm = TRUE)*100)/
                sum(!is.na(tobacco_current))) %>%
    rename("Ever"=tobacco_ever_year,
           "Current"=tobacco_current_year) %>%
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
    ggplot(aes(x=year,y=`Percentage of students`, linetype=User)) +
    geom_line() + 
  geom_point(show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
    scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does nicotine use vary over the years?",
         subtitle = "Current and ever users of nicotine products",
         y = "% of students")

plot1 
```

### Question 2

```{r}
plot2 <- nyts_data %>%
    group_by(year,
             Sex) %>%
    summarise(EELCIGT_year=(sum(EELCIGT, na.rm = TRUE)*100)/
                sum(!is.na(EELCIGT)),
              CELCIGT_year=(sum(CELCIGT, na.rm = TRUE)*100)/
                sum(!is.na(CELCIGT))) %>% 
    filter(!is.na(Sex)) %>%
    rename("E-cigarettes, Ever"=EELCIGT_year,
           "E-cigarettes, Current"=CELCIGT_year) %>%
  #converting all columns between and including `E-cigarettes, Ever` and `E-cigarettes, Current` into one column called category
    pivot_longer(cols = `E-cigarettes, Ever`:`E-cigarettes, Current`, names_to = "Category", values_to = "Percentage of students")%>%
    mutate(User = case_when(Category == "E-cigarettes, Ever" ~ "Ever",
                               Category == "E-cigarettes, Current" ~ "Current")) %>%
    # mutate(Sex = case_when(female == TRUE ~ "Females",
    #                            female == FALSE ~ "Males")) %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Sex, linetype=User)) +
    geom_line() + 
  geom_point(show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.text.x = element_text(angle = 90),
          axis.title.x = element_blank()) +
    labs(title = "How do vaping rates compare between males and females?",
         subtitle = "Current and ever users by gender",
         y = "% of students")

plot2
```

### Question 3

What vaping brands and flavors appear to be used the most frequently?

```{r, echo=FALSE, fig.cap="Huang J, Duan Z, Kwok J, et al. Tob Control 2019;28:146–151.", out.width = '100%'}
knitr::include_graphics(here::here("img", "HuangJ_DuanZ_KwokJ_et_al_TobaccoControl_Figure1.png"))
```

[Paper](https://tobaccocontrol.bmj.com/content/tobaccocontrol/28/2/146.full.pdf)


```{r}
plot3 <- nyts_data %>%
    filter(year==2019) %>%
    group_by(brand_ecig) %>%
    filter(!is.na(brand_ecig)) %>%
    summarise(n = n()) %>%
    mutate(total = sum(n),
           Percent = n*100/total) %>%
    mutate(brand_ecig = fct_reorder(brand_ecig, desc(Percent))) %>%
    ggplot(aes(x=brand_ecig,y=Percent, fill=brand_ecig)) +
    geom_bar(stat="identity") +
    theme_minimal() +
    theme(legend.position = "none",
          axis.text.x = element_text(size=10),
          axis.title.x = element_blank()) +
    labs(title = "What vaping brands appear to be used the most frequently?",
         subtitle = "Brand of e-cigarette most frequently used in the last 30 days (2019)",
         y = "% of e-cigarette users responding")

plot3
```

```{r}
plot4 <- nyts_data %>%
  filter(year!=2015) %>%
  group_by(year) %>%
  summarise(Menthol=(sum(menthol, na.rm = TRUE)*100)/
                sum(!is.na(menthol)),
              `Clove or Spice`=(sum(clove_spice, na.rm = TRUE)*100)/
                sum(!is.na(clove_spice)),
              `Fruit`=(sum(fruit, na.rm = TRUE)*100)/
                sum(!is.na(fruit)),
              `Chocolate`=(sum(chocolate, na.rm = TRUE)*100)/
                sum(!is.na(chocolate)),
              `Alcoholic Drink`=(sum(alcoholic_drink, na.rm = TRUE)*100)/
                sum(!is.na(alcoholic_drink)),
              `Candy/Desserts/Sweets`=(sum(candy_dessert_sweets, na.rm = TRUE)*100)/sum(!is.na(candy_dessert_sweets))) %>%
  pivot_longer(cols = -year, names_to = "Percentage of students", values_to = "Flavor") %>%

  ggplot(aes(x=year, y=`Percentage of students`, color=Flavor)) +
  geom_line() +
  geom_point(show.legend = FALSE) +
  theme_minimal() +
  theme(legend.position = "bottom",
          axis.text.x = element_text(angle = 90),
          axis.title.x = element_blank()) + 
  labs(title = "What flavors appear to be used the most frequently in nicotine products?",
       subtitle = "Flavors of tobacco products used in the past 30 days")

plot4 
```

```{r}
plot5 <- nyts_data %>%
  filter(year!=2015) %>%
  filter(menthol==TRUE|
           clove_spice==TRUE|
           fruit==TRUE|
           chocolate==TRUE|
           alcoholic_drink==TRUE|
           candy_dessert_sweets==TRUE|
           other==TRUE) %>%
  mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  mutate(ecig_only_ever = case_when(ecig_ever == TRUE &
                                      non_ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           ecig_only_current = case_when(ecig_current == TRUE &
                                           non_ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           non_ecig_only_ever = case_when(non_ecig_ever == TRUE &
                                            ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           non_ecig_only_current = case_when(non_ecig_current == TRUE &
                                               ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE)) %>%
  mutate(Group = case_when(ecig_only_ever==TRUE |
                             ecig_only_current==TRUE ~ "Only e-cigarettes",
                         non_ecig_only_ever==TRUE |
                           non_ecig_only_current==TRUE ~ "Only other products",
                                    TRUE ~ "Both")) %>%
  filter(Group!="Both") %>%
  group_by(year, Group) %>%
  summarise(`Menthol`=(sum(menthol, na.rm = TRUE)*100)/
              sum(!is.na(menthol)),
              `Clove or Spice`=(sum(clove_spice, na.rm = TRUE)*100)/
              sum(!is.na(clove_spice)),
              `Fruit`=(sum(fruit, na.rm = TRUE)*100)/sum(!is.na(fruit)),
              `Chocolate`=(sum(chocolate, na.rm = TRUE)*100)/
              sum(!is.na(chocolate)),
              `Alcoholic Drink`=(sum(alcoholic_drink, na.rm = TRUE)*100)/
              sum(!is.na(alcoholic_drink)),
              `Candy/Desserts/Sweets`=(sum(candy_dessert_sweets, na.rm = TRUE)*100)/
              sum(!is.na(candy_dessert_sweets)),
            `Other`=(sum(other, na.rm = TRUE)*100)/
              sum(!is.na(other)),
            Respondents=n()) %>%
  #converting all columns between and including Menthol and Other to one column called Flavor
  pivot_longer(cols = Menthol:Other, names_to = "Flavor", values_to = "Percentage of students") %>%
  filter(!is.na(`Percentage of students`),
         Flavor!="Other") %>%
  group_by(Flavor) %>%
  mutate(affirmative=(Respondents * `Percentage of students`)/100) %>%
  mutate(flavor_mean = sum(affirmative)/sum(Respondents)) %>%
  ungroup() %>%
  mutate(flavor_mean_rank = dense_rank(flavor_mean),
         Flavor = fct_reorder(Flavor, flavor_mean_rank)) %>%
  ggplot(aes(x=year, y=`Percentage of students`, color=Group)) +
  facet_wrap(.~Flavor,ncol=3) +
  geom_line() + 
  geom_point(show.legend = FALSE) + 
  theme_minimal() +
  theme(legend.position = "bottom",
          axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90)) + 
  labs(title = "Among users of only one type of product, what vaping flavors appear to be used the most frequently?",
       subtitle = "Percent reporting only e-cigarette use vs only other nicotine product use")

plot5
```

### Question 4

Have vaping rates possibly influenced tobacco/nicotine use?

```{r}
plot6 <- nyts_data %>%
    group_by(year) %>%
    summarise(ECIGT_year=(sum(ECIGT, na.rm = TRUE)*100)/
                sum(!is.na(ECIGT)),
              EELCIGT_year=(sum(EELCIGT, na.rm = TRUE)*100)/
                sum(!is.na(EELCIGT)),
              CCIGT_year=(sum(CCIGT, na.rm = TRUE)*100)/
                sum(!is.na(CCIGT)),
              CELCIGT_year=(sum(CELCIGT, na.rm = TRUE)*100)/
                sum(!is.na(CELCIGT))) %>%
    rename("Cigarettes, Ever"=ECIGT_year,
           "E-cigarettes, Ever"=EELCIGT_year,
           "Cigarettes, Current"=CCIGT_year,
           "E-cigarettes, Current"=CELCIGT_year) %>%
  pivot_longer(cols= -year, names_to = "Category", values_to = "Percentage of students")%>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(User = case_when(Category == "Cigarettes, Ever" ~ "Ever",
                               Category == "E-cigarettes, Ever" ~ "Ever",
                               Category == "Cigarettes, Current" ~ "Current",
                               Category == "E-cigarettes, Current" ~ "Current")) %>%
    mutate(Product = case_when(Category == "Cigarettes, Ever" ~ "Cigarettes",
                               Category == "E-cigarettes, Ever" ~ "E-cigarettes",
                               Category == "Cigarettes, Current" ~ "Cigarettes",
                               Category == "E-cigarettes, Current" ~ "E-cigarettes")) %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Product, linetype=User)) +
    geom_line() + 
  geom_point(show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette use compare to cigarette use?",
         subtitle = "Current and ever users of e-cigarettes and cigarettes",
         y = "% of students")

plot6
```

```{r}
plot7 <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
    group_by(year) %>%
    summarise(ecig_ever_year=(sum(ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(ecig_ever)),
              ecig_current_year=(sum(ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(ecig_current)),
              non_ecig_ever_year=(sum(non_ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_ever)),
              non_ecig_current_year=(sum(non_ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_current))) %>%
    pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    mutate(User = case_when(Category =="ecig_ever_year" ~ "Ever",
                           Category =="non_ecig_ever_year" ~ "Ever",
                           Category =="ecig_current_year" ~ "Current",
                           Category =="non_ecig_current_year" ~ "Current")) %>%
    mutate(Product = case_when(Category =="ecig_ever_year" ~ "E-cigarettes",
                           Category =="non_ecig_ever_year" ~ "Other products",
                           Category =="ecig_current_year" ~ "E-cigarettes",
                           Category =="non_ecig_current_year" ~ "Other products")) %>%
    filter(User=="Ever") %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Product)) +
    geom_line(linetype=1) + # geom_bar(stat="identity", position = "dodge", color="black") +
  geom_point(show.legend = FALSE) +
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette ever use compare to ever use of other products over the years?",
         subtitle = "E-cigarette and non-e-cigarette use",
         y = "% of students")

plot7
```

```{r}
plot8 <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
    group_by(year) %>%
    summarise(ecig_ever_year=(sum(ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(ecig_ever)),
              ecig_current_year=(sum(ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(ecig_current)),
              non_ecig_ever_year=(sum(non_ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_ever)),
              non_ecig_current_year=(sum(non_ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_current))) %>%
    pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(User = case_when(Category =="ecig_ever_year" ~ "Ever",
                           Category =="non_ecig_ever_year" ~ "Ever",
                           Category =="ecig_current_year" ~ "Current",
                           Category =="non_ecig_current_year" ~ "Current")) %>%
    mutate(Product = case_when(Category =="ecig_ever_year" ~ "E-cigarettes",
                           Category =="non_ecig_ever_year" ~ "Other products",
                           Category =="ecig_current_year" ~ "E-cigarettes",
                           Category =="non_ecig_current_year" ~ "Other products")) %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Product, linetype=User)) +
    geom_line() +
  geom_point(show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette use compare to use of other products over the years?",
         subtitle = "E-cigarette and non-e-cigarette use",
         y = "% of students")

plot8
```

### Unweighted Sample

```{r, fig.height=10}
plotA_uw <- plot1 +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
    labs(title = "Nicotine product users more prevalent after 2017",
         subtitle = NULL,
         y = "% of students")

plotB_uw <- plot7 + 
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
    labs(title = "% Ever trying e-cigarettes increases &\n% ever trying other products decreases",
         subtitle = NULL,
         y = "% of students")

plotC_uw <- plot8 + 
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
    labs(title = "% Using e-cigarettes increases &\n% using Other products decreases",
         subtitle = NULL,
         y = "% of students")

title_uw <- ggdraw() + 
  draw_label(
    "Have vaping rates possibly influenced tobacco/nicotine use?",
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_uw <- plot_grid(plotA_uw,
                     rel_widths = c(1,1))
plotsBC_uw <- plot_grid(plotB_uw,
                        plotC_uw,
                        rel_widths = c(1,1))

legend_uw <- get_legend(plotB_uw +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_uw <- plot_grid(title_uw,
                       plotsA_uw,
                       plotsBC_uw,
                       legend_uw,
                       ncol = 1,
                       rel_heights = c(0.1,
                                       1,
                                       1,
                                       0.1),
                       scale = 1.0)

figure_uw
```

### Weighted Sample

```{r}
plotA_w <- nyts_data %>%
    mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE),
           tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                    tobacco_sum_ever ==0 ~ FALSE),
           tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                    tobacco_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
  summarise(tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
            tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))  %>%
  mutate_at(vars(-year), "*", 100) %>%
    pivot_longer(cols = -year, names_to = "Type", values_to = "Percentage of students") %>%
    # gather(key=Type,
    #        value=`Percentage of students`,
    #        -year) %>%
  mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
                          grepl("_upp", Type) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Type) ~ "Ever",
                          grepl("current", Type) ~ "Current",
                          TRUE ~ "Mean")) %>%
  dplyr::select(-Type) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  ggplot(aes(x=year,y=Mean)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
    scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Nicotine product users more prevalent after 2017",
         y = "% of students")

plotB_w <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
    summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year)  %>%
  mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("current", Category) ~ "Current",
                          TRUE ~ "Ever",),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  filter(User=="Ever") %>%
  dplyr::rename("Lower_temp" = Upper,
                "Upper_temp" = Lower) %>%
  dplyr::rename("Lower"=Lower_temp,
                "Upper"=Upper_temp) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(linetype=1) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% ever trying e-cigarettes increases &\n% ever trying other products decreases",
         y = "% of students")

#### the wrangling looks the same as the above plot...
plotC_w <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
  summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% Using e-cigarettes increases &\n% using Other products decreases",
         y = "% of students")

title_w <- ggdraw() + 
  draw_label(
    expression("Have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w <- plot_grid(plotA_w,
                     rel_widths = c(1),
                     align = "v",
                     axis = "bt")
plotsBC_w <- plot_grid(plotB_w,
                     plotC_w,
                     rel_widths = c(1,1),
                     align = "v",
                     axis = "bt")

legend_w <- get_legend(plotB_w +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w <- plot_grid(title_w,
                      plotsA_w,
                      plotsBC_w,
                      legend_w,
                      ncol = 1,
                      rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                      scale = 1.0)

figure_w
```

### Hypothethical Cohort

```{r}
plotA_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE),
           tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                    tobacco_sum_ever ==0 ~ FALSE),
           tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                    tobacco_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
  summarise(tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
            tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))  %>%
  mutate_at(vars(-year), "*", 100) %>%
    pivot_longer(cols = -year, names_to = "Type", values_to = "Percentage of students")%>%
    # gather(key=Type,
    #        value=`Percentage of students`,
    #        -year) %>%
  mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
                          grepl("_upp", Type) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Type) ~ "Ever",
                          grepl("current", Type) ~ "Current",
                          TRUE ~ "Mean")) %>%
  dplyr::select(-Type) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  ggplot(aes(x=year,y=Mean)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_linetype_manual(values = c(2,1)) +
    scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Nicotine product users becoming increasingly prevalent",
         y = "% of students")

plotB_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
    summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
   pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students")%>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year)  %>%
  mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("current", Category) ~ "Current",
                          TRUE ~ "Ever",),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  filter(User=="Ever") %>%
  dplyr::rename("Lower_temp" = Upper,
                "Upper_temp" = Lower) %>%
  dplyr::rename("Lower"=Lower_temp,
                "Upper"=Upper_temp) %>%
  ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(linetype=1) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% ever trying nicotine products increases",
         y = "% of students")

plotC_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
  summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students")%>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "E-cigarette use surpasses use of other nicotine products",
         y = "% of students")

title_w_8 <- ggdraw() + 
  draw_label(
    expression("Among"~8^th~"graders in 2015, have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w_8 <- plot_grid(plotA_w_8,
                        rel_widths = c(1),
                        align = "v",
                        axis = "bt")

plotsBC_w_8 <- plot_grid(plotB_w_8,
                         plotC_w_8,
                         rel_widths = c(1,1),
                         axis = "bt")

legend_w_8 <- get_legend(plotB_w_8 +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w_8 <- plot_grid(title_w_8,
                        plotsA_w_8,
                        plotsBC_w_8,
                        legend_w_8,
                        ncol = 1,
                        rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                        scale = 1.0
)

figure_w_8
```

### Final Figure

```{r}
title_final <- ggdraw() +
  draw_label(
    expression("Have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=16,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_uw_final <- ggdraw() + 
  draw_label(
    expression(~6^th~"-"~12^th~"graders, unweighted"),
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_final <- ggdraw() + 
  draw_label(
    expression(~6^th~"-"~12^th~"graders, weighted"),
    fontface = 'bold',
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_8_final <- ggdraw() + 
  draw_label(
    expression(~8^th~"graders in 2015, weighted"),
    fontface = 'bold',
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_final <- plot_grid(subtitle_uw_final,
                            subtitle_w_final,
                            subtitle_w_8_final,
                            ncol = 3)

plotsA_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of e-cigarette use by user type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_final <- plot_grid(plotA_uw + theme(plot.title = element_blank()),
                          plotA_w + theme(plot.title = element_blank()),
                          plotA_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsB_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of ever use by product type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsB_final <- plot_grid(plotB_uw + theme(plot.title = element_blank()),
                          plotB_w + theme(plot.title = element_blank()),
                          plotB_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsC_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of nicotine product use by product & user type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsC_final <- plot_grid(plotC_uw + theme(plot.title = element_blank()),
                          plotC_w + theme(plot.title = element_blank()),
                          plotC_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

legend_final <- get_legend(plotB_w +
                             theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

final_plot <- plot_grid(title_final,
          plotsA_title_final,
          subtitle_final,
          plotsA_final,
          plotsB_title_final,
          subtitle_final,
          plotsB_final,
          plotsC_title_final,
          subtitle_final,
          plotsC_final,
          legend_final,
          ncol = 1,
          rel_heights = c(0.2,
                          0.2,
                          0.1,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.1))

final_plot
```

```{r, echo=FALSE, include=FALSE}
ggsave("final_plot.png")
```

## **Suggested Homework**
*** 

<style>
div.blue { background-color:#e6f0ff; border-radius: 5px; padding: 20px;}
</style>
<div class = "blue">

+ Apply survey weights to one of the figures produced in this case study in which weighted estimates were not produced. Include error bars in the updated figure.
    + Does the figure change after the application of survey weights?
    + If so, describe how. 
+ Reproduce `final_plot` above for a different cohort of your choice.

</div>

### Notes

Ever and current variables are limited to those shared by all years of data included in this case study.

```{r, echo=FALSE}
knit_time_end <- Sys.time()
```

```{r, echo=FALSE}
knit_time <- knit_time_end - knit_time_start
knit_time_message <- paste("Knit time:",
      round(as.numeric(knit_time) ,3),
      units(knit_time)
      )
```

<p style="color:red">New code: `r knit_time_message`. Previous code: ~ 3 m. </p>

### Problems

I had difficulty producing a plot that succinctly presented a trend. It's very easy to produce plots that are very useful once one is familiar with the data. Some plots, however, cannot stand alone and need additional context to be clear for those without prior knowledge of the data. When I first shared a plot I had been working on with others, it became clear that in my effort to present a complicated idea briefly I had left out information that would make the trend easily interpretable. To solve this issue, I began to present visualizations of the data alongside my original plot. The final figure I created contained several additional plots, each presenting the same trend at a different level than my initial plot.

My "centerpiece" plot is the middle plot in `final_plot`. The 8 plots around it help provide a very clear picture of what is going on in the US with regards to e-cigarette use and nicotine product use at large. On its own, it's difficult to understand the trends in the US and how important the weighting scheme is for inference. Once you add the left and right columns, it's clear what is going on. 



## **Data Analysis**
*** 

### **content header**
*** 



## **Summary**
*** 



## **Helpful Links**
*** 

review of [tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/){target="_blank"} 

guide for [preprocessing with recipes](http://www.rebeccabarter.com/blog/2019-06-06_pre_processing/)


## **Session info**
***

```{r}
sessionInfo()
```